Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German
نویسندگان
چکیده
The goal of this work is to design a machine translation system for a low-resource family of dialects, collectively known as Swiss German. We list the parallel resources that we collected, and present three strategies for normalizing Swiss German input in order to address the regional and spelling diversity. We show that character-based neural MT is the best solution for text normalization and that in combination with phrase-based statistical MT we reach 36% BLEU score. This value, however, is shown to decrease as the testing dialect becomes more remote from the training one.
منابع مشابه
ArchiMob - A Corpus of Spoken Swiss German
Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety rarely recorded and that it is subject to considerable regional variation. This paper presents a freely available general-purpose corp...
متن کاملAutomatic speech recognition and translation of a Swiss German dialect: Walliserdeutsch
Walliserdeutsch is a Swiss German dialect spoken in the south west of Switzerland. To investigate the potential of automatic speech processing of Walliserdeutsch, a small database was collected based mainly on broadcast news from a local radio station. Experiments suggest that automatic speech recognition is feasible: use of another (Swiss German) database shows that the small data size lends i...
متن کاملA Resource for Natural Language Processing of Swiss German Dialects
Since there are only a few resources for Swiss German dialects, we compiled a corpus of 115,000 tokens, manually annotated with PoStags. The goal is to provide a basic data set for developing NLP applications for Swiss German. We extended the original corpus and improved its annotation consistency. Furthermore, we trained dialect-specific PoS-tagging models and implemented a baseline system for...
متن کاملRhythmic variability in Swiss German dialects
Speech rhythm can be measured acoustically in terms of durational characteristics of consonantal and vocalic intervals. The present paper investigated how acoustically measurable rhythm varies across dialects of Swiss German. Rhythmic measurements (%V, �C, �V, varcoC, varcoV, rPVI-C, nPVIC, nPVI-V) were carried out on four sentences of six speakers from eight Swiss dialects. Results indicate th...
متن کاملVerb Clusters in Continental West Germanic Dialects
The Continental West Germanic Languages include the standard varieties of Dutch, Frisian, and High German, as well as a large number of non-standard varieties, the more familiar of which are the dialects spoken in Belgium and the South of the Netherlands (Flemish, Brabantish, Limburgian), Northern Germany (Low German), the Rhine Valley (Luxemburgish), South-Eastern Germany and Austria (e.g. Bav...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1710.11035 شماره
صفحات -
تاریخ انتشار 2017